-
-
Notifications
You must be signed in to change notification settings - Fork 379
Feature: support rectilinear chunk grid extension #3534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Feature: support rectilinear chunk grid extension #3534
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3534 +/- ##
==========================================
+ Coverage 60.94% 61.65% +0.71%
==========================================
Files 86 86
Lines 10268 10769 +501
==========================================
+ Hits 6258 6640 +382
- Misses 4010 4129 +119
🚀 New features to boost your workflow:
|
…ature/rectilinear-chunk-grid
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class RectilinearChunkGrid(ChunkGrid): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts on just calling this class Rectilinear, and renaming the RegularChunkGrid to Regular? We could keep around a RegularChunkGrid class for compatibility. But I feel like people know these are chunk grids when they import them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
50/50. I think the more descriptive class is useful when looking at a tracebacks. Plus, this is currently in .core so its not meant to be used directly by users.
| @given(data=st.data()) | ||
| async def test_basic_indexing(data: st.DataObject) -> None: | ||
| zarray = data.draw(simple_arrays()) | ||
| @given(data=st.data(), zarray=st.one_of([simple_arrays(), complex_chunked_arrays()])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the search space for the standard arrays strategy is so large, i made a different one complex_chunked_arrays that purely checks different chunk grids
with simple_arrays() we are only spending 10% of our time trying RectilinearChunkGrid so using this approach. We should boost number of examples too.
| 2. **Not compatible with sharding**: You cannot use variable chunking together with | ||
| the sharding feature. Arrays must use either variable chunking or sharding, but not both. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope this is a temporary limitation! There's a natural extension of rectilinear chunk grids to rectilinear shard grids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Davis Bennett <[email protected]>
|
is it intentional that the chunk grid is import zarr
store = zarr.storage.MemoryStore()
arr = zarr.create_array(
store,
shape=(60, 100),
chunks=[[20, 20, 20], [25, 25, 25, 25]],
zarr_format=3,
dtype="uint8"
)
arr.metadata.chunk_grid # RectilinearChunkGridOr is Not sure how much this matters in practice ( |
@keewis - yes, this is intentional. We decided this was preferable because it will allow the user to add variable lengths later (through an extend/append workflow). |
CodSpeed Performance ReportMerging this PR will not alter performanceComparing Summary
|
|
|
||
|
|
||
| @dataclass(frozen=True) | ||
| class ResolvedChunkSpec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be a class? It has no methods. Seems like either a TypedDict or just a tuple would work
| shards: tuple[int, ...] | None | ||
|
|
||
|
|
||
| def _validate_zarr_format_compatibility( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this here? Can't we check when we construct the array metadata document?
| ) | ||
|
|
||
|
|
||
| def _validate_sharding_compatibility( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be defined over in the array v3 metadata module, and used there to check that the array metadata document is valid?
| ) | ||
|
|
||
|
|
||
| def _validate_data_compatibility( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be stand-alone function? are we going to use this logic anywhere other than it's current usage site (resolve_chunk_spec)
| @@ -436,6 +468,21 @@ def to_dict(self) -> dict[str, JSON]: | |||
| return out_dict | |||
|
|
|||
| def update_shape(self, shape: tuple[int, ...]) -> Self: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this needs to take a parameter that can define the new chunks (which can default to None or some other sentinel flagging "default behavior" semantics)
|
FYI I tested this PR for implementing rechunking with variable-sized intermediate chunks in Cubed - and it worked! I found one wrinkle in that |
…arr-python into feature/rectilinear-chunk-grid
…ature/rectilinear-chunk-grid
|
I will give this a spin today, but assuming everything works, my question is: how can we ship this soon (this week?) while making it clear that the feature is experimental? |
|
I have varient chunked array; Happy to experiment to write it using this experimental feature(if it works with xarray...) |
Summary
Adds support for RectilinearChunkGrid extension (Zarr v3), enabling arrays with variable chunk sizes per dimension.
Closes: #1595 | Replaces: #1483 | Related: zarr-extensions#25
Key Features
RectilinearChunkGrid
from_array())[[10, 6]]= 6 chunks of size 10Chunk Grid Access
.chunksProperty BehaviorDesign Decisions
ChunksLikeas TypeAliasResolvedChunkSpecas frozen dataclass.chunksraises for rectilinear.chunk_gridRemoved from Earlier Designs
RegularChunks/RectilinearChunkstuple subclasseschunks.lat)ChunksTypeABC hierarchyDeferred / TODO Items
update_shape()optional chunks parametermetadata/v3.py:483chunk_grids.py:1513-1593Review Focus Areas
High Priority:
chunk_grids.py:RectilinearChunkGridclass,ChunksLiketype, RLE expansion/compression,resolve_chunk_spec()metadata/v3.py:update_shape()for rectilinear resize behaviorindexing.py: Variable chunk indexing with binary searcharray.py:.chunksproperty behavior,.chunk_gridpropertyTests:
test_chunk_grids/test_rectilinear.py: Comprehensive unit teststest_chunk_grids/test_rectilinear_integration.py: End-to-end scenariostesting/strategies.py: Hypothesis strategies for property-based testingBreaking Changes
None. Fully backward compatible.
TODO:
docs/user-guide/*.mdchanges/